home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Creative Computers
/
Creative Computers CD-ROM, Volume 1 (Legendary Design Technologies, Inc.)(1994).iso
/
text
/
info
/
68040_specs.pp
/
68040_specs
Wrap
Text File
|
1994-11-17
|
12KB
|
279 lines
Captured from the bix network:
==========================
microbytes/features #243, from microbytes, 11743 chars, Mon Jan 22 16:33:53 1990
--------------------------
TITLE: FIRST IMPRESSION: Motorola's New 68040 Microprocessor
by Tom Thompson
---------------------------
This new CISC microprocessor
offers RISC performance
---------------------------
Motorola has officially unwrapped its newest 32-bit
microprocessor, the 68040. Manufactured with 0.8-micron
high-speed CMOS technology, the 68040 packs 1.2 million
transistors on a single silicon die. With 900,000 extra
transistors to work with over the 300,000 transistors in a
68000 processor, the 68040's designers added new features
and boosted performance. New features include the following:
-- Optimized 68030 integer unit. While retaining object-code
compatibility with previous 68000-family processors, the IU
has been optimized to execute instructions in fewer clock
cycles (i.e., run faster). The claimed boost in performance is
three times that of a 68030.
-- Integral FPU. The 68020 and 68030 require external FPU
coprocessor chips to handle floating-point math. The 68040,
however, has an FPU built into it, giving it the power to do
serious number crunching. The FPU's data types are
compatible with the ANSI/IEEE 754 standard for binary
floating-point math, and its instruction set is object
code-compatible with Motorola's 68881/68882 FPUs. Like
the IU, the 68040's on-chip FPU has been optimized to
execute frequently used instructions using fewer clock
cycles. The claimed performance boost is 10 times that of a
68882.
-- Large caches. Processor accesses to the system bus are
minimized by storing the most recently used set of
instructions or data in on-chip, 4K-byte caches. Both caches
operate independently but can be accessed at the same time.
Bus snoop logic is used to maintain cache coherency (i.e., it
ensures that the cache's contents match those parts of
memory corresponding to the cache). The bus snooper's design
is fined-tuned to support multiprocessor systems where one
or more bus masters or 68040s might share the same section
of memory.
-- Separate memory units for instructions and data. Each
memory unit consists of a memory management unit, a cache
controller, and bus snoop logic. The MMUs use a subset of the
68030's MMU instruction set. Both memory units function
independently of each other to improve processor throughput.
The 68040 ships with an initial clock speed of25 MHz;
higher speeds are to be available in the future, Motorola says.
The 68040 comes in a 179-pin grid-array package. With the
elimination of coprocessor function lines (now that the MMU
and FPU are consolidated onto the processor) and the addition
of snoop control lines, the 68040 is not pin-compatible with
the 68030.
Because of the 68040's software compatibility with its
predecessors, it can tap into the existing software base of
680x0 applications. It does this not only while eliminating a
component (the FPU) from a computer's design, but also while
improving performance. In fact, the 68040 executes
instructions on the average of nearly once per clock cycle --
the same as a RISC processor.
Fine-Tuned for Performance
The 68040 was built on the firm foundation of its
predecessors. The design team used the experience garnered
from developing earlier processors to aid in optimizing the
throughput of the 040.
The 040 was designed from the ground up, Motorola engineers
said. It incorporates a high degree of parallelism using a
number of internal buses. An internal Harvard architecture
gives the processor full access to both instructions and data.
Both the IU and FPU have separate pipelines and can operate
concurrently. For example, the FPU can perform
floating-point instructions independently of the IU. Each
stream (instructions or data) has its own dedicated cache and
MU that function independently of each other. A smart bus
controller assigns priorities to bus traffic to and from the
caches.
There were several key areas where Motorola was able to
boost performance. The first was in reducing the clock cycles
needed to execute certain instructions. The next was to
ensure that the processor funnels instructions and data into
itself quickly and constantly, lest it stall while waiting on
information. The processor then gets its results back into the
system without interfering with incoming information.
Finally, as if this wasn't enough, the processor stays off the
system bus to a greater extent than is the case with other
processor designs. This lets DMA transfers and other bus
masters have use of it.
CISC with the Speed of RISC
The IU was optimized so that high-usage instructions execute
in fewer clock cycles, particularly branch instructions.
Motorola said it performed thousands of code traces using
real-world applications to determine which instructions
were used most often.The IU consists of 6 stages: instruction
prefetch, decode, effective address calculation, operand
fetch, execution, and writeback (i.e., the result is written to
either a register or to memory). Each stage works
concurrently on the instruction pipeline. Dual prefetch and
decode units deal with the branch instructions: One set
processes the instruction taken on the branch, and another
processes the intruction not taken. In this way, no matter
what the outcome, the IU has the net instruction decoded and
ready to go without seriously disrupting the pipeline. This
complex design has a big payoff: Motorola has determined
that the average instruction takes 1.3 clock cycles to
execute. The ability to execute an instruction once per clock
cycle is the performance edge of RISC processors -- yet the
68040's IU accomplishes the same goal while executing
complex-instruction-set computer (CISC) instructions.
The FPU adds 11 registers to the 68040 register set: Eight of
them are 80-bit floating-point registers, and three are
status, control, and instruction address registers. The FPU
has a three-stage execution unit, and, like the IU, each stage
operates concurrently. Load and store instructions (FMOVE)
can be performed during other arithmetic operations, and a
64- by 8-bit hardware multiplication unit speeds many
calculations. However, the FPU only implements a subset of
the 68882 instructions on-chip. The transcendental
(trigonometric and exponential) functions are emulated in
software via a software trap. But Motorola claims that even
these instructions should execute 25% to 100% faster on
25-MHz 68040 than on a 33-MHz 68882 FPU.
Boosting Throughput
In the area of throughput, each stream is managed by a
separate memory unit that uses an MMU for
logical-to-physical address translations during bus accesses.
These MMUs support demand-paged virtual memory. Both
MMUs have a four-way set-associative address translation
cache (ATC) with 4 entries (versus 22 entries for the 68030).
The ATCs reduce processor overhead by storing the most
recent address translations. When an address translation is
required, the ATC is searched, and if it contains the address,
it is used immediately. Otherwise, a combination of
high-speed hardware logic and microcode searches the
translation tables located in main memory.
Like the PU, these MMUs implement a subset of the 68030's
MMU instruction set. Gone are the PLOAD and PMOVE
instructions, because enhanced existing instructions made
them superfluous. Also, only 2 memory page sizes are
supported, 4K and 8K bytes, whereas the 68030 MMU
supported 8 page sizes ranging from 256 bytes to 32K bytes.
A design trade-off was made here: A performance gain was
possible by supporting only the 2 most common page sizes. In
any case, this change impacts only operating-system code,
since MMU instructions aren't normally used by applications.
The two on-chip 4K caches improve processor throughput in 2
ways: They keep the pipelines filled and minimize system bus
accesses. To see how this is done, you must examine the
structure of the cache. Each is a four-way set-associative
cache composed of 64 sets of four lines. A line consists of 4
longwords, or 16 bytes. Cache lines are read or written
rapidly using burst